Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language
نویسندگان
چکیده
In this paper, we describe how the TEITOK corpus tools helped to create a diachronic corpus for Old Spanish that contains both paleographic and linguistic information, which is easy to use for nonspecialists, and in which it is easy to perform manual improvements to automatically assigned POS tags and lemmas.
منابع مشابه
Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language, Gothenburg, Sweden, May 22, 2017
متن کامل
Normalizing Medieval German Texts: from rules to deep learning
The application of NLP tools to historical texts is complicated by a high level of spelling variation. Different methods of historical text normalization have been proposed. In this comparative evaluation I test the following three approaches to text canonicalization on historical German texts from 15th–16th centuries: rule-based, statistical machine translation, and neural machine translation....
متن کاملComparing Rule-based and SMT-based Spelling Normalisation for English Historical Texts
To be able to use existing natural language processing tools for analysing historical text, an important preprocessing step is spelling normalisation, converting the original spelling to present-day spelling, before applying tools such as taggers and parsers. In this paper, we compare a probablistic, language-independent approach to spelling normalisation based on statistical machine translatio...
متن کاملOCR and post-correction of historical Finnish texts
This paper presents experiments on Optical character recognition (OCR) as a combination of Ocropy software and data-driven spelling correction that uses Weighted Finite-State Methods. Both model training and testing were done on Finnish corpora of historical newspaper text and the best combination of OCR and post-processing models give 95.21% character recognition accuracy.
متن کاملHistoBankVis: Detecting Language Change via Data Visualization
We present HistoBankVis, a novel visualization system designed for the interactive analysis of complex, multidimensional data to facilitate historical linguistic work. In this paper, we illustrate the visualization’s efficacy and power by means of a concrete case study investigating the diachronic interaction of word order and subject case in Icelandic.
متن کامل